AITopics | articulatory feature

Collaborating Authors

articulatory feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Articulation-Informed ASR: Integrating Articulatory Features into ASR via Auxiliary Speech Inversion and Cross-Attention Fusion

Attia, Ahmed Adel, Liu, Jing, Wilson, Carol Espy

arXiv.org Artificial IntelligenceOct-13-2025

ABSTRACT Prior works have investigated the use of articulatory features as complementary representations for automatic speech recognition (ASR), but their use was largely confined to shallow acoustic models. In this work, we revisit articulatory information in the era of deep learning and propose a framework that leverages articulatory representations both as an auxiliary task and as a pseudo-input to the recognition model. Specifically, we employ speech inversion as an auxiliary prediction task, and the predicted articulatory features are injected into the model as a query stream in a cross-attention module with acoustic embeddings as keys and values. Experiments on LibriSpeech demonstrate that our approach yields consistent improvements over strong transformer-based baselines, particularly under low-resource conditions. These findings suggest that articulatory features, once sidelined in ASR research, can provide meaningful benefits when reintroduced with modern architectures.

artificial intelligence, machine learning, representation, (15 more...)

arXiv.org Artificial Intelligence

2510.08585

Genre: Research Report > New Finding (0.48)

Industry: Health & Medicine (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the Relationship between Accent Strength and Articulatory Features

Huang, Kevin, Foley, Sean, Lee, Jihwan, Lee, Yoonjeong, Byrd, Dani, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceJul-8-2025

This paper explores the relationship between accent strength and articulatory features inferred from acoustic speech. To quantify accent strength, we compare phonetic transcriptions with transcriptions based on dictionary-based references, computing phoneme-level difference as a measure of accent strength. The proposed framework leverages recent self-supervised learning articulatory inversion techniques to estimate articulatory features. Analyzing a corpus of read speech from American and British English speakers, this study examines correlations between derived articulatory parameters and accent strength proxies, associating systematic articulatory differences with indexed accent strength. Results indicate that tongue positioning patterns distinguish the two dialects, with notable differences inter-dialects in rhotic and low back vowels. These findings contribute to automated accent analysis and articulatory modeling for speech processing applications.

artificial intelligence, correlation, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2507.03149

Country: North America > United States > California (0.14)

Genre:

Research Report > New Finding (0.94)
Research Report > Experimental Study (0.89)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Training Articulatory Inversion Models for Interspeaker Consistency

McGhee, Charles, Gales, Mark J. F., Knill, Kate M.

arXiv.org Artificial IntelligenceJun-10-2025

Acoustic-to-Articulatory Inversion (AAI) attempts to model the inverse mapping from speech to articulation. Exact articulatory prediction from speech alone may be impossible, as speakers can choose different forms of articulation seemingly without reference to their vocal tract structure. However, once a speaker has selected an articulatory form, their productions vary minimally. Recent works in AAI have proposed adapting Self-Supervised Learning (SSL) models to single-speaker datasets, claiming that these single-speaker models provide a universal articulatory template. In this paper, we investigate whether SSL-adapted models trained on single and multi-speaker data produce articulatory targets which are consistent across speaker identities for English and Russian. We do this through the use of a novel evaluation method which extracts articulatory targets using minimal pair sets. We also present a training method which can improve interspeaker consistency using only speech data.

artificial intelligence, machine learning, minimal pair, (18 more...)

arXiv.org Artificial Intelligence

2505.20529

Country:

North America > United States (0.68)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.96)

Add feedback

Articulatory Feature Prediction from Surface EMG during Speech Production

Lee, Jihwan, Huang, Kevin, Avramidis, Kleanthis, Pistrosch, Simon, Gonzalez-Machorro, Monica, Lee, Yoonjeong, Schuller, Björn, Goldstein, Louis, Narayanan, Shrikanth

arXiv.org Artificial IntelligenceMay-30-2025

We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.13814

Country: North America > United States > California (0.29)

Genre: Research Report > New Finding (0.68)

Industry:

Health & Medicine > Therapeutic Area > Neurology (0.35)
Health & Medicine > Diagnostic Medicine (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture

Pillai, Leena G, Mubarak, D. Muhammad Noorul, Sherly, Elizabeth

arXiv.org Artificial IntelligenceApr-28-2025

Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.

articulatory feature, artificial intelligence, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2504.18099

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Acoustic to Articulatory Inversion of Speech; Data Driven Approaches, Challenges, Applications, and Future Scope

Pillai, Leena G, Mubarak, D. Muhammad Noorul

arXiv.org Artificial IntelligenceApr-21-2025

This review is focused on the data-driven approaches applied in different applications of Acoustic-to-Articulatory Inversion (AAI) of speech. This review paper considered the relevant works published in the last ten years (2011-2021). The selection criteria includes (a) type of AAI - Speaker Dependent and Speaker Independent AAI, (b) objectives of the work - Articulatory approximation, Articulatory Feature space selection and Automatic Speech Recognition (ASR), explore the correlation between acoustic and articulatory features, and framework for Computer-assisted language training, (c) Corpus - Simultaneously recorded speech (wav) and medical imaging models such as ElectroMagnetic Articulography (EMA), Electropalatography (EPG), Laryngography, Electroglottography (EGG), X-ray Cineradiography, Ultrasound, and real-time Magnetic Resonance Imaging (rtMRI), (d) Methods or models - recent works are considered, and therefore all the works are based on machine learning, (e) Evaluation - as AAI is a non-linear regression problem, the performance evaluation is mostly done by Correlation Coefficient (CC), Root Mean Square Error (RMSE), and also considered Mean Square Error (MSE), and Mean Format Error (MFE). The practical application of the AAI model can provide a better and user-friendly interpretable image feedback system of articulatory positions, especially tongue movement. Such trajectory feedback system can be used to provide phonetic, language, and speech therapy for pathological subjects.

artificial intelligence, inversion, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2504.13308

Genre:

Research Report (0.64)
Overview (0.54)

Industry: Health & Medicine > Diagnostic Medicine > Imaging (0.69)

Add feedback

Articulatory Encodec: Vocal Tract Kinematics as a Codec for Speech

Cho, Cheol Jun, Wu, Peter, Prabhune, Tejas S., Agarwal, Dhruv, Anumanchipalli, Gopala K.

arXiv.org Artificial IntelligenceJun-18-2024

Vocal tract articulation is a natural, grounded control space of speech production. The spatiotemporal coordination of articulators combined with the vocal source shapes intelligible speech sounds to enable effective spoken communication. Based on this physiological grounding of speech, we propose a new framework of neural encoding-decoding of speech -- articulatory encodec. The articulatory encodec comprises an articulatory analysis model that infers articulatory features from speech audio, and an articulatory synthesis model that synthesizes speech audio from articulatory features. The articulatory features are kinematic traces of vocal tract articulators and source features, which are intuitively interpretable and controllable, being the actual physical interface of speech production. An additional speaker identity encoder is jointly trained with the articulatory synthesizer to inform the voice texture of individual speakers. By training on large-scale speech data, we achieve a fully intelligible, high-quality articulatory synthesizer that generalizes to unseen speakers. Furthermore, the speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion. To the best of our knowledge, this is the first demonstration of universal, high-performance articulatory inference and synthesis, suggesting the proposed framework as a powerful coding system of speech.

articulation, speech, synthesis, (11 more...)

arXiv.org Artificial Intelligence

2406.12998

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.67)
Health & Medicine (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.93)

Add feedback

Automating Sound Change Prediction for Phylogenetic Inference: A Tukanoan Case Study

Chang, Kalvin, Robinson, Nathaniel R., Cai, Anna, Chen, Ting, Zhang, Annie, Mortensen, David R.

arXiv.org Artificial IntelligenceFeb-2-2024

We describe a set of new methods to partially automate linguistic phylogenetic inference given (1) cognate sets with their respective protoforms and sound laws, (2) a mapping from phones to their articulatory features and (3) a typological database of sound changes. We train a neural network on these sound change data to weight articulatory distances between phones and predict intermediate sound change steps between historical protoforms and their modern descendants, replacing a linguistic expert in part of a parsimony-based phylogenetic inference algorithm. In our best experiments on Tukanoan languages, this method produces trees with a Generalized Quartet Distance of 0.12 from a tree that used expert annotations, a significant improvement over other semi-automated baselines. We discuss potential benefits and drawbacks to our neural approach and parsimony-based tree prediction. We also experiment with a minimal generalization learner for automatic sound law induction, finding it comparably effective to sound laws from expert annotation. Our code is publicly available at https://github.com/cmu-llab/aiscp.

chacon and list, experiment, phylogenetic inference, (16 more...)

arXiv.org Artificial Intelligence

2402.01582

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York > New York County > New York City (0.04)
South America > Peru (0.04)
(5 more...)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Exploring Self-supervised Pre-trained ASR Models For Dysarthric and Elderly Speech Recognition

Hu, Shujie, Xie, Xurong, Jin, Zengrui, Geng, Mengzhe, Wang, Yi, Cui, Mingyu, Deng, Jiajun, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceJun-22-2023

The associated neural speech representations produced by these pre-trained Automatic recognition of disordered and elderly speech remains ASR systems are also inherently robust to domain mismatch [24-a highly challenging task to date due to the difficulty in collecting 26]. Although they have been successfully applied to a range of normal such data in large quantities. This paper explores a series of speech processing tasks including speech recognition [21-23, approaches to integrate domain adapted Self-Supervised Learning 27], speech emotion recognition [28] and speaker recognition [29], (SSL) pre-trained models into TDNN and Conformer ASR systems very limited researches on SSL pre-trained models for disordered for dysarthric and elderly speech recognition: a) input feature and elderly speech have been conducted [24, 30, 31]. Among these, fusion between standard acoustic frontends and domain adapted wav2vec2.0

artificial intelligence, machine learning, representation, (14 more...)

arXiv.org Artificial Intelligence

2302.14564

Country: Asia > China > Hong Kong (0.05)

Genre: Research Report (0.40)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.69)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging Features For Elderly And Dysarthric Speech Recognition

Hu, Shujie, Xie, Xurong, Geng, Mengzhe, Cui, Mingyu, Deng, Jiajun, Li, Guinan, Wang, Tianzi, Liu, Xunying, Meng, Helen

arXiv.org Artificial IntelligenceJun-22-2023

Articulatory features are inherently invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition (ASR) systems designed for normal speech. Their practical application to atypical task domains such as elderly and disordered speech across languages is often limited by the difficulty in collecting such specialist data from target speakers. This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training before being cross-domain and cross-lingual adapted to three datasets across two languages: the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech corpora; and the English TORGO dysarthric speech data, to produce UTI based articulatory features. Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems constructed using acoustic features only by statistically significant word or character error rate reductions up to 4.75%, 2.59% and 2.07% absolute (14.69%, 10.64% and 22.72% relative) after data augmentation, speaker adaptation and cross system multi-pass decoding were applied.

articulatory feature, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2206.07327

Country: Asia > China > Hong Kong (0.05)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.94)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback